dplyr 2: grouping with group_by, summarize, and arrange
This interactive tutorial introduces parts of the dplyr package: group_by,summarize,and arrange functions. You can write and run code in your browser using WebR.
Setup
The data is from the Stanford Open Policing Project and includes a subset of vehicle stops by the police in Evanston, IL in 2017. We’re reading the data in from a URL directly.
Summarize
We’ll start with summarize() (or summarise() - British spelling is accepted). We use mutate() when we want the output to have the same length as the input – when we’re operating on the individual elements in a vector - we want a value for every row in the data. When we want to condense multiple values down to a single (or a few values), such as taking the mean or standard deviation of a vector), we use summarize instead.
Note that even though there’s just one value, we get a tibble returned. This is what to expect with the tidyverse.
As a side note, if we needed the single value (or a single vector), we could pull() it out of the tibble/data frame:
We can compute more than one summary measure at the same time:
We get one column per summary variable we create. Once we group below, we’ll see why we get the output in columns instead of rows.
EXERCISE 1
Use summarize to compute the min() and max() vehicle_year
Across
If we want to apply the same summary functions to multiple columns in our data frame, we can write out all of the summary commands explicitly, or we can use across() to select which variables to summarize with which functions.
Let’s use the n_distinct() function to count the number of distinct values in each column (n_distinct(x) is the same as length(unique(x)). This will help us see which columns don’t have useful information because every value is the same.
across() selects columns using the helper functions you could give to select() directly. We’ll use everything() here to select all columns.
If you wanted to select columns using their names, put them in a vector (so it’s a single input argument):
If we want to apply multiple functions:
To fix the names in the output, explicitly name our summary functions in the list:
There are other options for output naming patterns available too.
EXERCISE 2
What if we want to do something to each column that is more complicated than a single function? For example, counting missing values, which you do with sum(is.na(x)). You can write your own function!
Now use this new na_count() function to count the missing values in each column:
This is similar to using the apply() function in base R:
Advanced Aside
Instead of creating the function ahead of time, we could define it as part of the summarize call. We use .x to stand in for each column, and we put ~ in front of the expression to signal we’re creating a function in place (lambda function).
This is an advanced technique outside of the scope of this workshop. But if you see a ~ inside across(), at least know that someone is defining their own function inline.
Group By
With base R, when we want to compute summary measures or do other computation on groups in our data (as defined by some grouping variable), we use functions such as tapply() or aggregate(). With dplyr, we can explicitly group our tibble into subgroups. This isn’t very useful by itself, but it is often combined with summarize() to compute summary measures by group.
First, what if we just group:
When we print this in the console,
# A tibble: 14,792 x 29
# Groups: outcome [2]
raw_row_number date time location beat subject_age subject_race subject_sex department_id
<dbl> <date> <tim> <dbl> <chr> <lgl> <chr> <chr> <dbl>
1 11249746 2017-01-01 00:56 60202 72 NA white male 13178
2 11249747 2017-01-01 04:43 60643 71 NA black male 13178
3 11249748 2017-01-01 09:21 60625 78 NA black female 13178
...
We see that it tells us that the tibble (data frame) is grouped by outcome, and that there are two groups. It doesn’t rearrange the rows, it just keeps track of the groups for us.
Now, combine with summarize. But first, let’s make the vehicle_age column we’ve been using actually part of the police dataset so that we don’t have to keep creating it:
Now, group and summarize:
Now we get one row for each group, and one column for each summary measure.
We can group by multiple columns, and we’ll get all of the combinations of values present across the columns:
Let’s compute the ratio of warnings to citations by subject_race - note that we can use the variables we create in later expressions within the same call to summarize():
There’s considerable variation here, from 1.1 warnings for every citation given to 2.2 warnings for every citation given.
EXERCISE 3
Compute the min() and max() vehicle_year for each vehicle_make.
Ungrouping
If you ever have a grouped data frame, you may need to ungroup it to get rid of the groups. To do so, use ungroup():
Slicing
One operation we could do with a grouped tibble is to select just certain rows from each group. For example, we could use the slice() function to select the first row from each group:
If you look at this output in the console, you’ll see the resulting tibble still has groups in it. This is a case where you might want to ungroup:
Arrange
Finally, we come to arrange(), which is how we sort the rows in our data. We would mostly use this when viewing our data, but it’s also useful when we need to compute a time series (lags and leads in the data), when we want to select just a few rows from each group, or any other order-sensitive transformations on our data.
To sort in reverse order, wrap the column name in desc().
Arrange by multiple columns, in order:
An example where it matters: compute time between stops in the dataset:
EXERCISE 4
Sort the data by vehicle_make and then vehicle_year.
Count
A bonus function that I use frequently: count(). It’s how you’d get output similar to table()
By itself, it counts the number of rows:
If you supply the name of a column, it makes a table:
This is the same result as if you grouped the data first:
You can group by multiple columns directly with count:
EXERCISE 5
How many times does each type of violation appear in the dataset? Bonus: sort the results from most to least frequent. You can do this with arrange() or look at the documentation for count() to find another option.
Recap
We’ve now covered the core dplyr functions: six “verbs” of common actions we may take with a data frame. There are lots of helper functions, variations, and special cases that can come up when using these functions, but we’ve covered the essentials.
All of these functions were for working with a single data frame. Next exercise we’ll talk about dplyr functions that will help us combine two different data frames together.
To learn more about some issues in computing police bias statistics, see https://fivethirtyeight.com/features/why-statistics-dont-capture-the-full-extent-of-the-systemic-bias-in-policing/ or from Northwestern researchers Prof. Redbird and Kat Albrecht, https://redbird.shinyapps.io/police-bias-map/.
Acknowledgements
This tutorial was adapted by Brennan Antone from original tidyverse tutorial by Christina Maimone.